CBT Campus' Online Skills Training Courses.

IT Skills

Enterprise Database Systems

Big Data

Accessing Data with Spark

it_dsadskdj_01_enus

it_dsadskdj_03_enus

it_dsadskdj_02_enus

Accessing Data with Spark: An Introduction to Spark

Course Number:
it_dsadskdj_01_enus

Expected Duration (hours)
1.1

Lesson Objectives

Accessing Data with Spark: An Introduction to Spark

recognize where Spark fits in with Hadoop and its components
describe Spark RDDs and their characteristics, including what makes them resilient and distributed
identify the types of operations which are permitted on an RDD and describe how RDD transformations are lazily evaluated
distinguish between RDDs and DataFrames and describe the relationship between the two
list the crucial components of Spark and the relationships between them and recognize the functions of the Spark Session, Master and Worker nodes
install PySpark and initialize a Spark Context
create and load data into an RDD
initialize a Spark DataFrame from the contents of an RDD
work with Spark DataFrames containing both primitive and structured data types
define the contents of a DataFrame using the SQLContext
apply the map() function on an RDD to configure a DataFrame with column headers
retrieve required data from within a DataFrame and define and apply transformations on a DataFrame
convert Spark DataFrames to Pandas DataFrames and vice versa
describe basic Spark concepts

Overview/Description

Explore the basics of Apache Spark, an analytics engine for working with big data that is built on top of Hadoop. Discover how it allows operations on data with both its own library methods and with SQL while delivering great performance.

Target

Prerequisites: none

Accessing Data with Spark: Data Analysis using Spark SQL

Course Number:
it_dsadskdj_03_enus

Expected Duration (hours)
0.9

Lesson Objectives

Accessing Data with Spark: Data Analysis using Spark SQL

recall the different stages involved in optimizing any query or method call on the contents of a Spark DataFrame
create views out of a Spark DataFrame's contents and run queries against them
trim and clean a DataFrame before a view is created as a precursor to running SQL queries on it
perform an analysis of data by running different kinds of SQL queries, including grouping and aggregations
recognize how Spark DataFrames infer the schema of data loaded into them and configure a DataFrame with an explicitly defined schema
define what a window is in the context of Spark DataFrames and when they can be used
create and analyze categories of data in a dataset using Windows
analyze data using Spark SQL

Overview/Description

Analyze a Spark DataFrame by treating it as though it were a relational database table. Discover how to create a view from a Spark DataFrame and run SQL queries against it and how to define and explore data in Windows.

Target

Prerequisites: none

Accessing Data with Spark: Data Analysis Using the Spark DataFrame API

Course Number:
it_dsadskdj_02_enus

Expected Duration (hours)
1.2

Lesson Objectives

Accessing Data with Spark: Data Analysis Using the Spark DataFrame API

recognize the features that make Spark 2.x versions significantly faster than Spark 1.x
specify the reasons for using shared variables in your Spark application and distinguish between the two options available for sharing variables
create a Spark DataFrame from the contents of a CSV file and apply some simple transformations on the DataFrame
define a transformation to view a random sample of data from a large DataFrame
apply grouping and aggregation operations on a DataFrame to analyze categories of data in a dataset
use Matplotlib to visualize the contents of a Spark DataFrame
perform operations to prepare your dataset for analysis by trimming unnecessary columns and rows containing missing data
define and apply a generic transformation on a DataFrame
apply complex transformations on a DataFrame to extract meaningful information from a dataset
work with broadcast variables and perform a join operation with a DataFrame that has been broadcast
use a Spark accumulator as a counter
store the contents of a DataFrame in a text file for archiving or sharing
define and work with a custom accumulator to count a vector of values
perform different join operations on Spark DataFrames to combine data from multiple sources
analyze data using the DataFrame API

Overview/Description

Explore how to analyze real datasets using the DataFrame API methods. Discover how to optimize operations using shared variables and combine data from multiple DataFrames using joins.

Target

Prerequisites: none